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Probability Densities by Penalty Function Methods 

by 


( 1 ) 


G.F. de Monti’ieher ^ ^ , R.A. Tapia^^ and J. R. . Thompson 

ABSTRACT 


Except in the extreme case ’when it is known a priori exactly to which 
finite dimensional manifold the probability density function which 
gave rise to a set of samples belongs^the parametric maximum likeli- 
hood estimation procedure leads to poor estimates and is unstable; 
while the nonparametric maximum likelihood procedure is undefined. 
Good and Gaskins have recently suggested replacing the nonparametric 
maximum likelihood estimate with a nonparametric maximum penalized 
likelihood estimate; however they did not show that these estimates 
existed. In this paper we develop a very general theory of maximum 
penalized likelihood estimation which should avoid many of these 
present difficulties, We also demonstrate that each reproducing 
kernel Hilbert space leads, in a vety natural way, to a m?.yiw niii 
penalized likelihood estimator and that a well-known class of repro- 
ducing kernel Hilbert spaces gives polynomial splines as the non- 
parametric maximum penalized likelihood estimates. In addition 
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our general theory is used to s_iow that Good's and Gaskins' non- 
parametric maximum penalized likelihood estimators are well defined 
and that one of their estimators gives exponential splines as the 
estimates. Finally we show that Good's and Gaskins' method of im- 
plementation does nob in general lead to their estimators. 

1. Introduction . Let £1 he a subset of H 11 . In this study we consider the 
problem of estimating the probability density function cp € L 1 (q) which gave 
rise to the random samples 6 f). The set n may be either bounded 

or unbounded. 

As usual we define L(v), the likelihood that v 6 L 1 (q) gave rise 
to the samples x^, .* ,,x N tiy 

N 

(1.1) L(v)=nv(x.) , 

i=l 1 

Let H(q) be a manifold in L^(q) and consider the following optimization 
problem: 


( 1 . 2 ) 


maximize L(v) ; subject to 


v 6 H(n), ^vdn = 1 and v(t) > 0 V t . 


We let d^i denote the Lebesgue measure on Q . By the maximum likelihood \ 
estimator (corresponding to H(q) ) we mean the functional 

L*: n N -> 2 Ll(a > b) 

N A 

(A denotes the Nth Cartesian product of A with itself and 2 denotes the 

N 

subsets of A) which assigns to each {x^ ...jX^} ££1 the solutions of 
problem (l.2). Any v 6 L ( ) is said to be a maximum likelihood 
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estimate (of the probability density 9 ) for the samples (x^ . 

The maximum likelihood estimator L* is said to be well defined if ^(xj^, . . .,x^) 
consists of exactly one function (equivalently problem (1.2) possesses a 
unique solution). It is also usual to say that L* is a parametric estimator 
if the manifold H(q) is finite dimensional and a nonparametric estimator 
otherwise. 

It is well known and part of the folklore that the standard histogram 
estimates are parametric maximum likelihood estimates said that when H(q) is 
a finite dimensional linear manifold the corresponding maximum likelihood 
estimator is well defined. Except in the case when it is known a priori that 
9 € Il(fl) , it is generally true that the parametric maximum likelihood 
estimates are far from satisfactory. Moreover the nonparametric maximum 
likelihood estimator is essentially unde fined. Some justification for these 
latter two statements follows. 

Clearly if the manifold H(q) can approximate the Dirac delta 

function, i.e., contains nonnegative functions whose support is a given small 

sphere centered at x € 0 , integrate to one and have arbitrarily large values 

at x , then problem (l.l) has no solution. Moreover this approximation 

property is enjoyed by most infinite dimensional manifolds of ; hence 

vre should not expect the nonparametric maximum likelihood estimation problem 

to have a solution. The situation is actually worse for it is often Lhe case 

that in the parametric case we choose H(q) from a sequence of spaces (S ) 

05 1 

where the dimension of S is m , S z> S and US is dense in L (q); 

m ' m+l m t m 

m=l 

hence the problem is definitely unstable and somewhat ill defined. Namely we 
are motivated to choose m large so that we can better approximate the 
probability density giving r* to the samples x^, . . . ,x^ ; however for large n 
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our problem approximates a problem which has no solution. 

Rosenblatt [7] in 1956 performed the first analytical study of the 
theoretical properties of histograms. In 1962 Farzan constructed a class of 
estimators which properly included the histogram estimators and examined the 
consistency properties of the estimators in this class. These results have 
been improved upon recently by Wohba [10] (1971)* Kimeldorf and Wahba [3] in 
1970 introduced the application of spline techniques in contemporary statistics. 
Boneva, Kendall and Stefanov [1] in 1971 and Schnenberg [8] in 1972 examined 
the use of spline techniques for obtaining from hi stograms smooth estimates 
of a probability density function. It is of interest to us that essentially 
all previous authors seem to either ignore the nonnegativity constraint or 
attempt to handle it with the seemingly clever trick of working with a function 
whose square is to be used as the estimate of the probability densityj how- 
ever in the case of maximum likelihood estimation this trick tacitly ignores 
the noruiegativity constraint. More will be said about this in Sections 3 and 4. 

In 1971 Good and Gaskins [2] suggest adjoining a penalty term to the 
likelihood functional (l.l). They actually suggested two nonparametric 
maximum penalized likelihood estimators; however they di not show that these 
estimators v/ere meaningful, i.e,, well defined. Morsc er in dealing idth the 
nonnegativity constraint in problem (l.2), Good and Gaskins also fell into 
the trap described above of obtaining the estimate as the square of the solution 
of an optimization problem; hence Good's and Gaskins’ implementation does not, 
in general, give their estimator. 

In Section 2 we give a rigorous definition of the maximum penalized 
likelihood estimator. We also propose a very natural penalty term in the 
case when the underlying manifold is a reproducing kernel Hilbert space and 
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show that a very important and. well-known class of reproducing kernel Hilbert 
spaces gives rise to maximum penalized likelihood estimates which are poly- 
nomial splines with knots at the sample points. 

Sections 3 and 4 contain a rigorous analysis and proof of the fact 
that the Good and Gaskins maximum penalized likelihood estimators and their 
pseudo maximum penalized likelihood estimators obtained by their incorrect 
method of implementation are well defined and in the first case identical, 
but in the second case distinct. It is also of interest that in Section 3 
we show that Good's and Gaskins' first nonparametric maximum penalized likeli- 
hood estimator leads to estimates which are exponential splines with knots at 
the sample points. 

/ 

Much of our analysis uses the notions of the Frechet gradient, the 
Frechet derivative and the second Frechet derivative ir. an abstract Hilbert 
space. The reader not familiar with these notions is referred to Tapia f 93 • 
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2. Maximum Penalized Likelihood Estimators . In order to avoid the pitfalls 
and numerical instabilities attributed to the presently used maximum likeli- 
hood estimation procedures we suggest adjoining a penalty term to the like- 
lihood functional. 

Let H(fi) be a manifold of real-valued functions defined and 

n 1 

integrable on a set Q cr R , i.e., H(fi) C L (O’) . Consider a functional 
§:H(q) R . Given the samples x^,..,,x { j 6 0 we define the $ ^penalized 
likelihood of v 6 H(q) by 

H 

(2.1) L(v) = n v(x, ) exp (-§(v)) . 

i=l 1 

Consider the constrained optimization problem: 

A 

(2.2) maximize L(v); subject bo 

v € H(0) , ^vd|i = 1 and v(t) > 0, V t M . 

The maximum penalized likelihood estimator L corresponding to 
the set H(n) and the penalty function $ is defined in a manner analogous 
to the definition of the maximum likelihood estimator given in Section 1, 
using the solutions of problem (2.2). The term- parametric, the term 
nonparametric and the term . well defined have the same meaning in this 
context as in Section 1. For the remainder of the paper we consider the 
nonparametric case of the maximum penalized likelihood estimator; specifi- 
cally we will choose H(fl) to be either an infinite dimensional. Hilbert space 
or an infinite dimensional manifold in a Hilbert space. In the case when 
H(ft) is a Hilbert space a very natural penalty function to use is $(v) = 

Hvl| where ||*Jj denotes the norm on H(fi) , Consequently when H(o) is a 
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Hilbert space and we refer to the penalized likelihood functional on H(ft) 
or to the maximum penalized likelihood estimator corresponding to H(o) 
with no reference to the penalty functional 5 we are assuming that f is 
the square of the norm in H(fl) , Recall that when H(fi) is a Hilbert 
space it is said to be a reproducing kernel space if point evaluation is a 
continuous operation, i.e., v n -> v in H(ft) implies v r (x) -» v(x) V x € n. 

In order for problem (2.2) to make sense we would like H(fi) to 
have the property that for each (x^,...,x N ) € 0 there exists at least 
one v € H(fi) such that 


(2.3) [vdu - 1, v't) > 0 V t € Q and v(x. ) >0 i - 1,...,N, 

h 1 

Proposition 2.1 . Suppose that H(q) is a reproducing kernel space and D 
is a closed convex subset of {v 6 H(q)s v(x^) > 0) with the property that 
D contains at least one function which is positive at the points x^, ...,x^. 
Then the penalized likelihood functional, on H(n) has a unique maximizer in 
D. 

Proof . Since H(fi) is a reproducing kernel space we have |v(x^)| < K^llvjl 
for i = 1,..,,N. It follows that 


(2.4) 


lL(v)| exp (-l|vt| 2 ) 


N 


The function 0(x) = \ K exp (-X^) is bounded above by (N/2) 2 exp (-N/2); 
hence |L(v)j < C g , If M = sup[L(v):v € D) , then there exists (v^} c D 
such that L(v^) M . From our hypothesis M > 0. Notice that 6(x) -» 0 
as X -> . Hence from (2.4) l|v^|| < Y j The ball (v 6 H(fl): 

ll v ll < C J is weakly contact. Hence {v.) contains a weakly convergent 
subsequence which we also denote by (v^ ) . let v* denote the weak limit 
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of tvj). We have that v^(x^) -* v*(x^) as j ■+ •» for each i = 1,,..,N. 
The norm is a convex functional; hence weakly lower semicontinuous so that 
lim llvjll > Hv1| . It follows that 

N N 

(2.5) limn v.(x, ) exp(-!|v.|| 2 ) <H v*(x. ) exp(-||v*|| 2 ) . 
j i=l J 0 i=l 

However the left-hand side of (2.5) is equal to M and the right-hand side 
is equal to L(v*); so M < L(v x ‘). Now since D is closed and convex it is 
weakly closed; hence v* € I>. This establishes the existence of a maximizer. 

A 

Since M > 0, maximizing L over D is equivalent to maximizing 

s* t 

J log L over D. A straightforward calculation gives the second Frechet 
derivative of J as 


N |i(x. )T](x. ) 

J"(v)(|i,Tl) = - £ - 2 < M > • 

i=l vix^) 


Now since J"(v) is negative definite J is strictly concave and can there- 
fore have at most one maximizer on a convex set. This proves the proposition. 
Proposition 2,2 . Suppose H(q) is a reproducing kernel spac«., integration over 
n is a continuous functional and there exists at least one v € H(fl) 
satisfying (2,3\ Then the maximum penalized likelihood estimator corresponding 
to H(n) is well defined, & 

Proof . The proof follows from Proposition 2.1 since the constraints in (2.2) 
give a closed convex subset of (v € H(fi) : v(x^) > 0, i = l,...,*!} , 

Recall that by the Sobolev space of order s on the real line we 

mean 


& 

(2.6) H s (-»,w) = € S’ :(l+u> 2 ) 2 F[^] (tt) € L 2 (-«,&>)) 


where S' is the space of distributions with polynomial increase at infinity 
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and F[u) denotes the Fourier transform of The norm of u € H s (-»,®) 
is given by 


(2.7) |lu||. « ||(V-‘.» 8 ) 2 P[u](a))ll g 

H (-oo,cd) II ( -COjCt) 

If s is an integer, then u £ H“(-ro,») if and only if u,u^ , . . . ,u^ £ 


L (-<»,co) and an equivalent norm is given by 

s 
£ 

i-0 


1 

(2.8) I S*ill« (1, l£, -> , p) ] 5 


where w. > 0 and w^.w >0. We have the analogous definitions in the case 
of the finite interval; however when considering the Fourier transform we 
must extend the functtoi. to the entire interval (-oo,®) . As in the pre- 
vious section the notation H s (a,b) does not preclude the possibility that 
either a or b (or both) may be infinite. The reader interested in more 
detail is refered to Lions and Magenes [5] . 

Lemma 2.3 . The Sobolev space H s (a,b) is a reproducing kernel space if and 
only if s > ^ . Moreover for s > ~ the linear functional I:H s (a,b) -* R 
defined by 


b 

I(v) = jV(t)dt 

a 

is continuous if and only if (a,b] is a finite interval. 

Proof . The proof follows in a reasonably straightforward manner using results 
in Lions [ 3] . 

Proposition 2.U , The maximum penalized likelihood estimator corresponding 
to the Hilbert space H s (a,b) where s > ~ and [a,b] is a finite interval 
containing the sample points is well defined. 

Proof . The proof follows from Proposition 2.2 and Lemma 2.3. 

Recall that if s is an integer, then 
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H®(a,b) « {u € H s (a,b):u (k ^(a) = u^ k) (b) « 0, k * 0,...,s-l} 

Lot HQ(a,b) b". the collection of functions in H®(a,b) with the Hilbert 
space structure induced instead by the inner product 

(2,9) < u,v > - Ju^ s M S ^ . 

a 

It can be shown that H^^b) and H®(a,b) are equivalent, l.e., have the 
same topology, in a manner similar to that which shows that (2,7) and (2.8) 
are equivalent. Clearly H®(a,b) and H°(a,b) do not have the same inner 
product . 

Theorem 2,5 . Suppose (a,b) is a finite interval properly containing the 
sample points , Let s be a positive integer. Then the maximum 

penalized likelihood estimator corresponding to H®(a,b) is well defined 
and gives as an estimate a polynomial spline of degree 2s . Moreover, if 
the estimate is positive in the interior of an interval, then in this interval 
it is a polynomial spline of degree 2s and of continuity class 2s-2 with 
knots exactly at the sample points. 

Proof . Clearly H®(a,b) is a reproducing kernel Hilbert space since 
H®(a,b) xs such a space. It follows that the maximum penalized likelihood 
estimator corresponding to H^(a,b) is well defined from Proposition 2.2. 

Consider an interval I + - [o,0] c [ a,b] , Let I_ ® ft € [a,b]: 
t j t la,Pl) . Define the two functionals J + and J on H®(a,b) by 

J (v) = £ log v(x. ) - JV(t) 2 dt 

i < 

and 

J (v) = Z log v(x. ) - Jv(t) 2 dt , 

i 1 r 
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where the summation in the first formula is taken over all i such that 
x^ € I + and the summation in the second formula is taken over all i such 
that x^ € I_ .It should be clear that 

J(v) = J + (v) + J_(v) 

where as before J(v) - log £(v) and is the penalized likelihood in 
H^Cayb) , Let V* denote the maximum penalized likelihood estimate for the 
samples x^,.,»,x^ . Suppose V* is positive on the interval I + . We 
claim that V # restricted to this interval solves the following constrained 
optimization problem: 


maximize J + (v)j subject to 

('T l") v e H s (a,b) , v (m) (a) » vj m) ( ff ) , v (m) (p) = v^(p) , 

u ~ 0| * * * — 1 j 

fv(t)dt = JV*(t)dt and v(t) < 0 , t € I + . 

To see this observe that if v + satisfies the constraints of problem (2.10) 

* 

and J H (v. x .) < J H .(v + ) , then the function v defined - by 

( v + (t) , t € I + 

v # (t) , t € I_ 


satisfies the constraints of problem (2.2) with H®(a,b) playing the role of 
H(n) and J(v*) = J + (v*) + J_(v*) < J + (v + ) + J_(v + ) = J(v ) ^ which in turn 
implies that L(y # ) < L(v ); however this contradicts the optimality of v . 
How define the functional G on by 
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G(v) = J + (v + + v) for v € Hq(o,P) . 
Consider the constrained optimization problem 


maximize G(v) ; subject to 


( 2 . 11 ) 


v 6 and | v = 0 . 

If v satisfies the constraints of problem (2.1l), then v* + tv satisfies 
the constraints of problem (2.10) for t sufficiently small, since v* is 
positive in I + . It follows that the zero function is the unique solution 
of problem (2.11). From the theory of Lagrange multipliers we therefore must 
have 


( 2 . 0 2 ) 


VG‘,0) + \v Q - 0 , 


where X is a real number, VG(o) is the Frechet gradient of G at 0 and v Q 
is the Frechet gradient of the functional v - JV in the space H®(a,{j) . 
Clearly in this case v Q is merely the Riesz representer of the functional 


v -* Jv . 
Specifically 


I* (s) (s) f. 

! v 6 v - 


(2s) 

Integrating by parts in the distribution sense we see that v^ = 1; hence Vq 
is a polynomial, of degree 2s in la,p] . A straightforward calculation shows 


that 

(2.13) 



where the summation is taken over i such that € I + and v^ is the Riesz 
representer of the functional v -* v(x^) H q S ^(°>P) j i.e.» 
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jyuy-o „ v ( Xi ) . 

(2s) 

As before integrating by ports in the distribution sense we t 4 e that v^ - 6^ 
where 6^ is the Dirac mass at the point x^ . It follows that v^ is a 
polynomial spline of degree 2s-l and of continuity class 2s-2 with a 
knot exactly at the sample point x^ . From (2.12) and (2.13) we have that 
v* restricted to the interval [of,p] is a polynomial spline of degree 2s 

and of continuity class 2s -2 with knots exactly at the sample points in 

[a,p] . A simple continuity argument takes care of the case when v* is 
only positive on the interior of [a,p] . This proves the theorem. 

Remark. Observe that Theorem 2.5 says that the spline estimate is necessarily 
zero at knots which are not sample points. 

In the case when s=l we can say substantially more about the 
distribution of the knots and zeros of the spline estimate, 

Theorem 2.6 . Suppose (a,b) is a finite interval properly containing the sample 
points x^j • Then the maximum penalized likelihood estimator corres- 

ponding to Hj(a,b) is well defined and gives as an estimate a continuous 
quadratic spline with knots at the sample points and at most two knots in the 

interior of each interval [x^,x^ +1 l > i = 0, ...,N + 1 (x Q - a and x^ +1 = b) . 

Moreover in each such interval the spline is either zero at no points, zero 
at one point (which must be a knot) or zero on a proper subinterval whose end- 
points are necessarily knots. 

Proof . Suppose the estimate v* is zero at a and p where x^ < a < P < 
and not identically zero in (or, Pi . Consider the function 



t 4 ta,p] 
t € t of,P ] . 


| 


-lit- 


r j *. 

t* “ 


Ci early y = l/Jv > 1 . We also have that 


a 




j(ofV ) > J(v ) > J(Vj-) 


and that yv 6 H^(a,b) , yv (t) > 0 for t € [a,b] and J’yv = 1 , This, 

a 

however, contradicts the optimality of v* and shows that v* must be iden- 
tically zero in the interval [ar,pj . The remainder of the theorem follows 


from Theorem 2,5 and the remark following it 
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3. The First Maximum Penalized Likelihood Estimator of Good and Gaskins. 

In [2] Good and Gaskins consider the maximum penalized likelihood estimator 

corresponding to the penalty function 

2 

i ± W) - a J* dt (or > o) . 

They do not define the manifold H(fi) j but it is obvious from the constraints 
that must be satisfied and the fact that 



-W 


'*> that the underlying manifold H(q) should be 


= (v € L’ 1 ' : ^/v € (-»,») ) . 


This leads us to analyzing the following constrained optimization problem: 


(3.1) maximize L (v) = n v(x. )exp(-$. (v) ) j subject to 

X i=l 1 

v € t JV(t)dt = 1 and v(t) >0 V t 6C-®,®) 

-00 

In an effort to avoid the nonnegativity constraint in problem (3.1) 
Good and Gaskins considered working with the instead of v. Speci- 

fically if we let u = Jv , then restating problem (3.1) in terms of u we 
obtain 

N 2 

maximize n u(x.) exp^^n’ (t) dt) ; subject to 
i=l -<*> 


(3.2) 
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u € H 3 ^-®,®), J'u(t) 2 dt = 1 and u(tf > 0, t € (**•»••) . 

-«0 

o 

Since the constraint u(t) >0 is redundantthey suggest solving problem 

(3.2) for u # and then accepting v # = u # as the solution of problem (3.l). 

On first impressions everything looks fine; however & moments reflection 
should convince the reader that what tacitly has been assumed is that the 
unique solution of problem ( 3*2 ) is actually nonnegative. Hence adding the 
nonnegativity constraint to problem (3.2) and restating in the equivalent 
form obtained by taking the square root of the objective functional (since 

it is nonnegative) we arrive at the following constrained optimization problem: 

* N 

(3.3) maximize L(v) = n v(x. )exp(-5(v) ); subject to 

i=l 1 

09 

v C* H^( -*»,«*) , J'v(t) 2 dt = 1 and v(t) > 0, Y t € (-“>>») 

mtO 

where 

2 

$(v) = 2ofj*v , (t) dt 
-00 

and a is given in problem (3*1). 

Proposition 3.1 . Problems (3.1) and (3.3) are equivalent in the sense that 
if v # is a solution of problem (3.1), then */v* is a solution of problem 

(3.3) and if v* is a solution of problem (3.3 ) , then v* is a solution of 
problem (3.1). 

Proof . The proof follows from the fact that if v > 0, then 

$Cv£) = |? x (v) 

and 

LjM = L(v^) 2 . 


It is very surprising and quite fortunate that Good’s and Gaskins' 
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omission does not really effect this estimator; since we will presently 
show that the nonnegativity constraint in problem (3.3) is not active at 
the solution, i.e., problems (3.2) and (3-3) actually have the same solu- 
tions. Unfortunately this will not be the case for the second maximum 
penalized likelihood estimator Good and Gaskins propose- Good and Gaskins 
did not show that their estiiiacors are well defined; hence this is our first 
task. Along with problem (3.3) we will consider the constrained op- 
timization problem obtained by only requiring nonnegativity at the sample 
points: 

A 

(3.4) maximize L(v) ; subject to 

CO 

v € H^(~»,«o) , J'v(t) dt -- 1 and v(x.) >0, i = 1,...,N . 

-CO 

Given X > 0 and <x in problem (3.3) we may also consider the 
constrained optimization problem: 

^ N 

(3.5) maximize L (v) = n v(x. )exp(-$ (v)) ; subject to 

k i=l 1 k 

CO 

v € H^-cojOo) , J‘v(t) 2 dt = 1 and v(x^) >0, i = 1,...,N 
-00 

where 

00 (O * 

$^(v) = 2ccjv'(t) 2 dt + Xj’v(t) dt . 

-00 -00 

Our study of problem (3.5) will begin with the study of the following 
constrained optimization problem: ' 

t, 3.6) maximize L^(v); subject to 

v € l^t and vCk^) > 0 , i = 

where is gives, 'fcy jyf&lem (3.5). , T/et L 2 = L 2 


Proposition 3.2 . Problem (3*6) has a unique solution. Moreover if 
denotes this solution, then 
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(i) is an exponential spline with knots at the sample points 

x ^» •"> x j{i 

(ii) v^(t) >0 , V t 6 (- co ,®); and 

(iii) livji 2 > VH/(4XJ . 

L 

Proof . From Lemma 2.3 fh -»,<») is a reproducing kernel space. Also 

Ml ~ 5^( v ) gives a norm equivalent to the original norm on If 1 ^ -<»,<») , 

The existence of v^ now follows from Proposition 2.1 with D = {v 6 H^( -»,<») 

v(x ) > 0, i = 1,...,N). We will denote the inner product by < > > . 

**• K A 

Let v i be the representer in the inner product of the continuous 
linear functional given by point evaluation at the point i = 1,..,,N, i.e 

< V 71 > X = ’ V 71 € Iil (^ 0 » eo ) • 


Equivalently 


CO 00 

2<yJV{(t)lT(t)dt + Xjv i (t)7l(t)dt = Tl(x.) , V T\ € . 

~Q> *03 

Integrating by parts in the distribution sense gives 

oo 

J|-2orv”(t) + Xv i (t)) Tj(t)dt - Tl( Xi ) , V T) € -»,«.) ; 


hence 


(3.7) 


- 2orv" • Xv A = 6 j[ , 


i = 1,...,N 


where 6 i (t) = 6 Q (t-x^) and 6 Q denotes the Dirac distribution, i.e., 

00 

J 6^(t )TJ (t ) dt = 7] Co) . If we let v Q be the solution of (3-7) for i = 0, 


then 
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V*) - 


ex pC^7(2^yt) 

exp ( ->Jkf{ 2o)’t ) 


t < 0 


t > 0 


and v^t) = Vgft-x.^) for i = 1 ,.,,,N. Since is the maximizer we 

have that v. (x. ) > 0 , i - 1, . . . ,11 we necessarily have that the Frechet 
K 1 

A 

derivative of at v^ must he the zero functional 5 equivalently the 

gradient of or for that matter the gradient of log must vanish 

A A 

at v^ since and log have the same maxima, A calculation similar 

to that used in the proof of Proposition 2,1 gives 

(3.8) 7 x l0g ^(V) = 2V -.^7^ 

where denotes the gradient. It follows from ( 3 . 8 ) that 


( 3 . 9 ) 


, N v. 

V X 2 ^ v ( x ) 

A i=n u i ; 


Properties (i) and (ii) are now immediate. Since < v^,v^ = v^(x^) 

from (3*9) we have 


(3.10) 


!|v x llj * N/2 


A straightforward calculation shows that 


So 


v|(t)vj(t) < |^ v^.CtKjCt) , for i,j =1,...,N 
p 1 v ±^ P v'(t)v'Ct) 

^ (t) ■ * [ ? ( w> ^ 1 


v i (t) 


* -j _ V (t)v (t) 

< BJ ie( 7 T 3 T)^ e vTx )v" 7 x T 3 

i V V V W V 


.-^2 


2 or X 


(*) 



Integrating in t gives 


2 a|lv'|l 


■{ir 2 , > • 


By definition of the §^-norm and (3.10) we have property (iii). This proves 
the proposition. 

Proposition 3.3 . Problem (3-4) has a unique solution. 

CD 

Proof . Let B - (v € H 1 ( -»,«): J'v(t) 2 dt < 1 and v(x^) >0, i = 1,...,N). 

*00 A 

Clearly B is closed and convex. If is given by (3.5), then by Pro- 
position 2.1 the functional has a unique maximizer in Bj say u^ . How by 
property (iii) of Proposition 3.2 if we choose 0 < X < ^ , then v^ the 

unique solution of problem (3-6) will be such that ||v^|| 2 > 1. We will 

L (-00,®) 

show that for this range of X,l|u^|| 2 = 1 . Consider v^ = 0v^ + 

(l-9)u^. We know that log is a strictly concave functional (see the 
proof of proposition 2.1). Moreover log L^(v^) > log L^(u^);hence log L^(v^) > 
log L^(u^) for 0 < 6 < 1 . Now suppose ||u^|l 2 <1 and consider 

g(e) - llv 0 ll , . 

L 

We have g(o) < 1 and g(l) > 1. So for some 0 < 0 Q < 1, b(&q) = 1 and 

log L^(u^) < log 1 ^( v q )• This is a contradiction since u^ is the unique 
^ 0 

maximizer of in Bj hence l|u^|l 2 =1, This shows that u^ is 

L (-®,®) ^ 

the unique solution of problem (3.5) for 0 < X < -jj . However, the term X 

JV(t) dt is constant over the constraint set in problems (3.4) and (3.5); 

*00 

hence problems (3.4) and (3.5) have the same solutions for any X > 0 . 

This proves the proposition since we have demonstrated that problem ( 3 . 3 ) has 
a unique solution for at least one X . 

Proposition 3.** . Problem (3.3) has a unique solution which is positive and 
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an exponential spline with knots at the points x^,...,x N . 

Proof . If we can demonstrate that v the unique solution of problem (3-*0 
has these properties we will be through. Let G(v) = log L(v) where L 
is given in problem C 3 • 3^ let 

g(v) - Jv(t) 2 dt 
-» 

for v € Clearly v(x^) > 0 for i - 1,...,N; hence from the theor;.- 

ru 

of Lagrange multipliers there exist X such that v satisfies the equations 
(3.H) G’(v) - Xg'(v) = 0 and g(v) - 1. 

Using L 2 (-m,oo) gradients in the sense of distributions (3.1l) is equivalent, 
to 

N 6. 

(3.12) - W+ sxv * r vTv" T e(v) s i 

i=l v ' x i' 

CO 

where 6. is the distribution such that j\r(t)G^(t)dt - v(x^), i - 1,...,U. 
Since we have already established that problem (3«M has a unique solution 
it follows that (3.12) must have a unique solution in namely v. 

If X < 0, then any solution of the first equation in (3.12) would be a sum 
of trigonemetric functions and could not possibly satisfy the constraint 

p 

g(v) * 1, i.e., can not be contained in L (-“,'■’). It follows that X > 0. 

Now observe that 

G - Xg = log L^ 

where 1^ is given by problem (3.5) » hence if v satisfies (3-H) (from 
the first equation alone) it must also be a solution of problem (3-6) for 
this X and therefore has the desired properties according to Proposition 
3.2. This proves the proposition. 



Proposition 3.5 . The first nonpar ametric maximum penalized likelihood 
estimator of Good and Gaskins is well defined; specifically tii maximum 
penalized likelihood estimator corresponding to the penalty function 
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m 2 

$(v) * Of j* dt (a > 0) 

—CO 

and the manifold 

H(0) = (v € L 1 (-«,«): y'T'e A — ,-)} 

is well defined. Moreover the estimate for the sample points x^,...,Xy 
given by this estimator is positive and an exponential spline with knots 
at the sample points. 

rv 2 

Proof # From Proposition 3*1 this estimate is v where v solves problem 

(3-3). %■ Proposition 3.^ v is positive and an exponential spline with 
. ~2 

knots at x^,...,x^; hence so is v . This proves the proposition. 



- 23 - 


4, The Second Maximum Penalized Likelihood. Estimator of Good and Gaskins . 
Consider the functional SjH 2 {-«>,») -> R defined by 

(4.1) §(v) - a j’v'(t) 2 dt + p JV'(t) 2 dt 

-00 -CO 

for some a > 0 and p > 0 . Also consider the functional defined on 
- (v 6 -«,<»): € ^(-co,®)) hy 

(4.2) ^(v) ~ § (*A) 

where $ is given by (4.1). By the second maximum penalized likelihood 
estimator of Good and Gaskins we mean the estimator corresponding to the 
manifold Jifi (-<»,«) and the penalty function . Hence we must consider 
the following constrained optimization problem: 


(4.3) maximize L^v) = II v(x i )exp(-$ 1 (v)) j subject to 

i=l x 

v € v^W) , JV(t)dt = 1 and v(t) > C V t € (- 00 ,®) . 

—CD 

As in the first case (described in the previous section) Good end Gaskins suggest 
avoiding the nonnegativity constraint by calculating the solution of problem 

(4.3) from the following constrained optimization problem: 

A N 

(4.4) maximize L(v) = II v(x. )exp(- p$(v)); subject to 

i=l 

v € Hf* ( •* j® ) $ J*v(t) dt = 1 and v(x^) > Oj i — 1>< * * 

-CO 

whero $ is given by (4.1). 


Clearly problems (4.3) and (4.4) are equivalent in the sense that 
the solution of one can be obtained from the solution of the other by either 
taking the square or square root if and only if the solutions of problem 
(4.4) are nonnegative. Moreover we will presently demonstrate that the solu- 
tions of problem (4.4) are not necessarily nonnegative. It will then follow 
that we can not obtain the second estimator by considering problem (4.4). 

If we naively use v* 9 where v^ solves problem (4,4), as an esti- 
mate for the probability density function giving rise to the. samples 

2 

x^, . . . ,x^ , then clearly v* will be nonnegative and integrate to 1 and 
is therefore a probability density; however the estimator obtained in this 
manner will not in the strict sense of our definition be a maximum penalized 
likelihood estimator. For this reason we will refer to this latter esti- 
mator as the pseudo maximum penalized likelihood estimator of Good and Gaskins. 
The next six propositions are needed to show that the second maximum 
penalized likelihood estimator and the pseudo maximum penalized likelihood 
estimator of Good and Gaskins are distinct ana <, *11 defined. 

Proposition 4.1 . The second maximum penalized likelihood estimator and the 
pseudo maximum likelihood estimator of Good and Gaskins are distinct. 

Proof. We will show that it is possible for problem (4.4) to have solutions 


which are not nonnegative. Toward this end let II = 1, 0, or = 0 and 



g(v> . ;v(t) 2 dt . 


As in the proof of Prop obit ion 3.4 using the theory of distributions and the 
theory of Lagrange multipliers we see that the solutions of problem (4.4) 
in this case are exactly the solutions of 

(4.5) v ^ iv ^ i Xv = gvToT 40(1 g ( y ) = 1 

where 6^ is defined in the proof of Proposition 3.4, If we let v denote 
the Fourier transform of v, then taking the Fourier transform of the first 
expression in (4.5) gives 

v(o>) = [2v(o)(x + lSrtV')]*' 1 . 


Since ||v|( 2 = ||y|| 2 = 1 we must have 

L (-eojco) L (-«,») 


(4.6) 


J 


d Vr5 


(x+i6n tu' 


For the integral in (4.6) to exist we mast have X > 0 . How the inverse 

. , 4 -1 

Fourier transform of (X+l6n a) ) is given by v where 


(4.7) v(t) = 



[cos bt - sin bt] 


l cos bt + sin bt] 


t < 0 


t > 0 


with b - X^/a/T. From (4.7) v(o) = (8b^) _1 and from (4.6) v(0) c = 


iA t 

Z X K 
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1 

where K = ||(l+l6nV*r 1 || 2 5l . Hence X^= 2K and b = . It follows 

IT( — ,«) 

that the unique solution of problem (4.4) is given by (4.7) with b = v^K 
which is clearly not nonnegative. This proves the proposition. 

We will devote the remainder of this section to showing that both 
the , second estimator and the pseudo estimator are well defined. The 
approach taken will be very similar to that used in Section 3 to show that 
the first estimator of Good and Gaskins is well defined. 

Given X > 0 consider the constrained optimization problem: 

a N 

(4.8) maximize I^(v) = n v(x^)exp(-$^(v) ) ; subject to 

CO 

v € H 2 (-«,*) , JV(t) dt = 1 and v^) >0, i * 

-CO 

where 

00 

# x (v) = 7T $(v) + xj‘v(t) 2 dt 
with §(v) given by (4,l). 

As before we also consider the constrained optimization problem 
obtained by dropping the integral constraint: 

(4.9) maximize L^(v) ; subject to 

v 6 I^(-®,ob) and vtx^ > 0 , i = 1,.,.,N. 

Proposition 4 . 2 . Problem (4.9) has a unique solution. Moreover if v^ 
denotes this solution, then 


IK II o -> + » as X •* 0 . 

L (^»,eo) 

Proof . By lemma 2.3 the Sobolev space H 2 (-<»,co) is a reproducing kernel 


space. Moreover, if 
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11< = $ x (v) , 


then an integration by parts gives 


(4.10) 


llv'll p - 1< V,V M > 2 1 < llvll g|lv"|| 2 
IT IT IT if 


< fUM| 2 2 + l|v"l! 2 2 l 


2 2 

where L denotes L (-«>,«)• hence ||*|| is equivalent to the original 
norm on ■ h -«o,a5 ) . The existence and uniqueness of v^ now follows from 
Proposition 2.1. 

We must now show that ||v. || 0 + » as X *-> 0. From the funda- 


mental theorem of calculus we have 

..2 

mi 

dv 

(4.11) 


v(x) 2 = J* dt - 2 J* v(t)v' (t)dt 


<2iivii ginivii 2 . 

if if 

Also, l|v ,, || ? < MJJF so that from (4.10) and (4.1l) 
L “ K 


( 4 . 12 ) 


2 

v(x) 2 < 2 llvl! 2 2 J|v|| /vf . 

if x 


Evaluating (4.12) at x^, taking logs (since v(x^) > 0) and summing over i 
gives 

N 
1 
i= 

Hence from (4.13) we see that 


(4.13) 2 log V ^ X i^ -5 log ( V^' V "x^ + iF 1 o b(I|v||^ 2 ) • 


(4.14) log I^(v) < ^ log(llv|| 2 ) + | log( l|v|l x ) - ||vli 2 • 


In a manner exactly the same as that used to establish (3.10) we have that 



- 28 - 


V 

we obtain 

(4.15) log I^(v) < log(||v x !| 2 ) + | log(8N/3) - | , 

L 

for any (uf ^(-oo,w): u(x^) >0, i = 1,...,N) . 

Let a and b be such that 

a < min(x, ) and max(x. ) < b . 
i 1 i 

Given X > 0 and e and 6 define the function 0^ in the following piece 
wise fashion: 



f~ X e exp(-(t-a) 2 /2o 2 ) 

for 

t 6 (-«>a) 

S x (t) = • 

) x « 
) 

for 

t € la,b] 


/ X € exp(-(t-b) 2 /2a 2 ) 

for 

t 6 (b,-^») 


where a - X^ . Straightforward calculations can be used to show 

N 

log( n e,(x. )) = eNlog(x) , 

. i=l K x 

||e x l| 2 2 = (b-a)x 2e + v?fx 2e+6 , 

\\Q'f = ^fx 2e " 6 , 

K L 

lle x ll 2 2 = a^nx 2 ®" 36 , 

K IT 

and 

(4.16) ||S X H 2 = (b-a)X 2e+1 + v^X 2e+5+1 + 4oK/SlX 2e " 6 + 2p^ifx 2 ®" 36 . 

2 

If we want *♦ 0 as X -* 0 it is sufficient to choose all exponents 

of X in (4,l6) positive. If we also want 


||v x ll 2 = -■ , Hence from (4.14) and the fact that log I^(v) < log I^( 
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N 

log( n 0. (x, ) ) -> + » a s X -> 0 
i=l A 1 

we should choose e < 0 . This leads to the inequalities 


2s + 1 > 0 
2c + 6 + 1 > 0 
(4.17) 2e - 6 > 0 

2c - 36 > 0 


e < 0 . 


The system of inequalities (4,17) has solutions; specifically e - - ^ and 

6 = - ^ is one such solution. With this choice of c and 6 we see that 

log L^(o^) •» + co as X -* 0 . It follows from (4.15) by choosing v = 0^ 

that ||v |l ^-> + oo as X -» 0 . This proves the proposition. 

K if 

Proposition 4,3 « Problem (4,8) has a unique solution. 

Proof . By Proposition 4.2 there exists X > 0 such that if v^ is the unique 

solution of problem (4.9), then ||v. || 2 > 1 . Now, if B = (v 6 I? (-»,«>): 

® 2 h 

J'v(t) at < l) and v(x^) >0, i = 1, ...,N), then B is closed and convex. 

-CO 

The proof of the proposition is now exactly the same as the proof of Proposition 


3.3. 

Proposition 4.4 . The pseudo maximum penalized likelihood estimator of Good 
S,nd Gaskins is well defined. 

Proof . Since problems (4.4) and (4.8) have the same solutions the proposition 
follows from Proposition 4.3. 

By the change of unknown function v •* Jv we see that problem (4.3) 
is equivalent to the following constrained optimization problem: 

A N 

maximize L(v) = H v(x. )e A p(-^$(v)); subject to 
1=1 1 


(4.18) 


where $(v) is given by (4.1). 

In turn for X > 0 problem (4.l8) is equivalent to 

(4.19) maximize L^(v); subject to 

v € H 2 (-»,=») , Jv(t) 2 dt = 1 and v(t) > 0 V t 6 (-«.,<*>) 

•CO 

where L is defined in problem (4.8). 

A 

As in the previous two cases we also consider the constrained opti- 
mization problem: 

/■s 

(4.20) maximize L^(v)i subject to 

v € and v(t) >0 V t 6 (-»,<») 

where L^(v) is defined in problem (4.8). 

Proposition 4.$ . Problem (4.20) has a unique solution. Moreover if v. de- 

A 

notes this solution, then 

llv^li a"* 4 ' 10 as * 0 • 

L 

+ 

Proof . The existence of v^ follows from Proposition 2.1 as in the proof 
of Proposition 4.2. Let us first show that 

(4.21) Itv*ll x < JSJT . 

From Lions 14, p. 9 ] we see that 

(4.22) ^(v*)Cn-v*) < 0 

for ail nonnegative T1 in H 2 ( -» ,w ) ■ We have 


hence 


i=l 


i 


(4.23) %(v*)(v*) = » - 2 ll v x ll x • 

Now choosing T| - 0 in (4.22) and using (4.23) we arrive at (4,21). The 
functions 0^ defined in the proof of Proposition 4,2 satisfy the con- 
straints of this problem; hence 

log \(e x ) < log \( y x ) * 

From (4.14) and (4.21) ve have 

(4.24) log ^(e^) < ^ logdlv^ll 2 ) + § log(8N/fl) + | . 

L “ 

Ihe proof now follows from (4.24) since log L^(S^) ■+ + •* as X -» 0 . 
Proposition ^ ,6 . The second maximum penalized estimator of Good and Gaskins 
is well defined. 

Proof . Using Proposition 4.5 the argument used to prove Proposition 4.3 
shows that problem (4.19) ha* a unique solution which is also the unique 
solution of problem (4.18) . This proves the proposition. 

The authors would like to thank Professors B.F. Jones, P.E. Pfeiffer 
and W.A. Veech for helpful discussions. They would also like to thank the 
referee for helpful suggestions. 
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APPENDIX 

Numerical implementation 

We wish to implement numerically the maximum penalized likelihood 
estimator corresponding to the reproducing kernel space H^(a,b) discussed 
in Section 2. Toward this end we introduce a partition of the interval (&,!>); 


a = t rt < t, < t < t, , = b , 

0 1 m m+1 


where the mesh spacing is equal to h - (b-a)/m for some predetermined positive 

integer m , We let denote the value of the discrete solution at the mesh 

point t^ . Clearly since we are approximating elements in H^(a,b) we will 

require that y^ ~ y ^ = ^ * We c ^ loose as a discrete approximation to the 

derivative at the mesh point t^ the first forward difference (y i+1 - y^)/h. 

As the discrete form of the integral constraint we choose the trapezoidal rule, 

m ^ 

which in this case leads to £ y. = h - . Given the samples x 1 ,...,Xjj € ta,b] 

i~l 1 h h 

let denote the number of samples in the interval [t^ - 2*^1 + 2 ^ for 

i = 2,...,m -1, let denote the number of samples in [a,t^ + £) and 

finally let a denote the number of samples in the interval ft - ^ ,b] , 
m me' 

Our discrete maximum penalized likelihood estimate is obtained as the solution 

of the following constrained finite dimensional optimization problem: 

m a. _ 2 m 2 

maximize J(y, ,...,y ) « n y. exp(-h" 2 (y. . - y , ) ) j 

m i= i i i=0 x x 


subject to 


-1 

2 y. = h and y. >0 , i = l,..,,m . 
1=1 


The fact that this optimization problem has a unique solution follows as in the 
proof of Proposition 2.1. Figure 1 shows our numerical results when this 
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procedure was applied to 100 samples obtained from the uniform distribution 
and Figure 2 shows the result obtained when this procedure was applied to 
100 samples obtained from the Gaussian distribution. Since the curves are 
only described at the mesh points we have interpolated linearly between 
every two mesh points. 
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